Identifying outliers is a very common task in data pre-processing. They can alter the perceived importance of a sample by a model and, if not handled properly, can alter the result of any analysis. A simple method for identifying them is using the Interquartile Range.
What is the Interquartile Range?
IQR (Interquartile Range) is the difference between the third and the first quartile of a distribution (or the 75th percentile minus the 25th percentile). It is a measure of how wide our distribution is since this range contains half of the points of the dataset. It’s very useful to make an idea of the shape of the distribution. For example, it is the width of the boxes in the boxplot.
Outlier definition using IQR
Once we calculate it, we can use IQR to identify the outliers. We label a point as an outlier if it satisfies one of the following conditions:
- It’s greater than 75th percentile + 1.5 IQR
- It’s less than 25th percentile – 1.5 IQR
Applying this simple formula, we can easily detect the outliers of our distribution. Boxplot uses the same method to plot the outliers as points outside the whiskers.
The reasons behind that 1.5 coefficient rely upon the normal distribution, but the general idea is to calculate outliers without using some measure that could be affected by them. That’s why using, for example, the standard deviation, could lead us to poor results. Quartiles and percentiles are based on counts, so they are less vulnerable to the presence of outliers.
The idea is that if a point is too far from the 75th percentile (or from the 25th percentile), it’s a “strange” point that can be labeled as an outlier. The order of magnitude of such a distance is the IQR itself.
Let’s see a simple example in Python programming language.
Outlier detection in Python
In this example, we’ll generate some randomly distributed points according to a normal distribution, then we’ll add two outliers artificially in order to see if the algorithm is able to spot them.
First, let’s import NumPy and let’s set the seed of the random number generator.
import numpy as np
np.random.seed(0)
Now, let’s create our normally distributed dataset.
x = np.random.normal(size=100)
Let’s add two outliers, for example, -10 and 5.
x = np.append(x,[5,-10])
Since the normal distribution has 0 mean and variance equal to 1, these two numbers are very far from the mean and very rare. We can calculate their frequency explicitly using the cumulative distribution function of a normal distribution, which can be calculated using scipy.
from scipy.stats import norm
The probability of having a value less than -10 is:
norm.cdf(-10)
# 7.61985302416047e-24
The probability of having a value greater than 5 is:
1-norm.cdf(5)
# 2.866515719235352e-07
So, these values are so rare and far from the mean that they can be considered outliers.
Now, let’s calculate the IQR:
iqr = np.percentile(x,75) - np.percentile(x,25)
Finally, we can create a True/False array mask to identify the outliers according to the original formula:
outliers_mask = (x > np.percentile(x,75) + 1.5*iqr) | (x < np.percentile(x,25) - 1.5*iqr)
As expected, they are perfectly identified:
x[outliers_mask]
# array([ 5., -10.])
Conclusions
Dealing with outliers is always a problem for a data scientist. We can detect the presence of outliers using proper Exploratory Data Analysis, but if we want to label them correctly, we must apply a suitable algorithm. Although it works only in a univariate way, outlier detection with IQR is a simple and strong help to any data scientist and analyst.